1 Data Science for Health and Biomedical Sciences Assessment

A report using prescription data from the NHS to investigate and illustrate the spatial and temporal trends in scabies prescriptions in Scotland.

2 Introduction

Scabies is a highly contagious parasitic infection, often spread in crowded living situations. Aside from the severe toll the skin condition can cause on both mental and physical health, severe forms of the disorder can even lead to issues such as heart disease or kidney infections (WHO: Scabies) While a rise in scabies has been reported, including outbreaks within the University of Edinburgh, little data actually exists on the rise of this surprisingly debilitating disease. I’ve met a lot of friends that have had scabies during their time at the university, yet there’s not a lot of quantitative analysis of the data behind this. In this paper, I will be examining:

  • the relationship (if any) between prescriptions of permethrin and malathion, the two most commonly prescribed medications for scabies, and deprivation status, as measured by SIMD quintile
  • the relationship (if any) between prescriptions of permethrin and malathion and overcrowding
  • fluctuations in scabies medication prescription over time

2.1 Preparatory data handling

2.1.1 Loading required libraries

pacman::p_load(knitr, tidyverse,janitor, gt, here, sf, patchwork, purrr, plotly, ggspatial, cowplot)
#a function loading in all of the libraries I use

2.1.2 Data

data_january2024 <- read_csv("https://www.opendata.nhs.scot/dataset/84393984-14e9-4b0d-a797-b288db64d088/resource/d3eaaf84-0f3b-4fb8-9460-e33503095fbe/download/pitc202401.csv") %>% 
  clean_names()

gp_practice_jan2024 <- read_csv("https://www.opendata.nhs.scot/dataset/f23655c3-6e23-4103-a511-a80d998adb90/resource/54a6e1e3-98a3-4e78-be0d-1e6d6ebdde1d/download/practice_contactdetails_jan2024-open-data.csv") %>% 
  clean_names() %>% 
  select(practice_code, data_zone) #the gp practice data will give us the data zones of the gp practices mentioned in the January 202

datazone_pop_clean <- read_csv("https://www.opendata.nhs.scot/dataset/7f010430-6ce1-4813-b25c-f7f335bdc4dc/resource/c505f490-c201-44bd-abd1-1bd7a64285ee/download/dz2011-pop-est_09092022.csv") %>% 
  clean_names() %>% 
  filter(year == 2021,
         sex == "All") %>% 
  select(data_zone, population = all_ages) #this datazone will later be used to give us the most recent population information available for each datazone

scottish_index_deprivation <-st_read(here("data","SG_ScotInedMultipleDeprivation_2020/SG_SIMD_2020.shp"), quiet = TRUE) %>% 
  clean_names() #the SIMD data includes the most recent data about deprivation in Scotland over the different datazones. We will be focusing on the data of the SIMD quintiles, a measure using different variables to categorise each datazone 1-5 based on perceived deprivation, and on house_o_crat, a measure of the percentage of overcrowding in the datazones.
#The last dataset is from [Spatial SIMD data](https://spatialdata.gov.scot/geonetwork/srv/eng/catalog.search#/metadata/02866b0b-66e5-46ab-9b1c-d433dc3c2fae).

2.1.3 Selecting and filtering scabies medication data

per_mal_2024 <- data_january2024 %>% 
  select(gp_practice, bnf_item_description,paid_quantity) %>% 
  filter(str_detect(bnf_item_description, "PERMETH|MALATH")) #selecting only pertinent columns, filtering for scabies medications permethrin and malathion

We’re going to join the data containing prescription information with the dataframe of the datazones assigned to GP practices, and then to datazone population information.

per_mal_with_gps_2024 <- per_mal_2024 %>% 
  left_join(gp_practice_jan2024, by=c("gp_practice" = "practice_code")) %>% 
  left_join(datazone_pop_clean, by="data_zone") #join with gp practices and datazones

GP practices from a ‘community pharmacy’ have the code 9996, “unallocated” practice codes are coded 99997, “hospital” 99998. These would be useful in a temporal analysis, but not a location analysis, so for the further location analysis, we can remove them.

per_mal_datazones_allocated_2024 <- per_mal_with_gps_2024 %>% 
  filter(!gp_practice %in% c(99996, 99997, 99998))

3 Scabies prescriptions by Deprivation Quintiles (SIMD)

We will join the permethrin and malathion prescription data with data on the deprivation of datazones using the SIMD dataset. We will specifically be focusing on the deprivation quintile and the percentage of people living in overcrowded housing.

per_mal_2024_deprivation <- per_mal_datazones_allocated_2024 %>% 
  left_join(scottish_index_deprivation, by="data_zone") %>% 
  select(paid_quantity, data_zone, population, quintilev2, house_o_crat) %>% 
  mutate(scabies_prescriptions_per_1000 = (paid_quantity*1000/population))
#Joining to all SIMD data, creating a new column calculating the number of scabies prescriptions per 1,000 people in the population of a datazone.
prescriptions_by_quintile <- per_mal_2024_deprivation %>% 
  filter(!is.na(quintilev2)) %>% 
  group_by(quintilev2) %>% 
  rename(`SIMD quintile`=quintilev2) 
#filters out quintiles without values, groups by quintile and renames quintilev2
# tabset from: [tabset markdown source] (https://bookdown.org/yihui/rmarkdown-cookbook/html-tabs.html)

3.1 Prescriptions by Quintile

3.1.1 Prescriptions by quintile (Table 1)

A table showing the mean and standard deviation values of scabies prescriptions in Scotland in January 2024, grouped by SIMD quintiled.

prescriptions_by_quintile_table <- prescriptions_by_quintile  %>% 
    summarise(`Mean number of prescriptions` =  mean(scabies_prescriptions_per_1000, na.rm=TRUE), 
          `Standard deviation` = sd(scabies_prescriptions_per_1000, na.rm=TRUE)) %>% 
  gt() %>% 
  fmt_number(columns = c(`Mean number of prescriptions`, `Standard deviation`), decimals = 1) %>% 
  tab_header(
    title = "Permethrin and Malathion prescriptions",
    subtitle = "Averaged over SIMD quintiles")
#creating a table that groups the data by SIMD quintile and calculates the mean and standard deviation of prescriptions per 1,000 people.
prescriptions_by_quintile_table
Permethrin and Malathion prescriptions
Averaged over SIMD quintiles
SIMD quintile Mean number of prescriptions Standard deviation
1 398.5 533.6
2 387.0 380.4
3 408.2 490.8
4 426.9 421.9
5 437.9 504.8

The SIMD quintiles are bands of 20%, with the 20% most deprived regions residing in the 1st SIMD quintile and the 20% least deprived residing in the 5th quintile.

3.1.2 Prescriptions by quintile (Figure 1)

A boxplot representing the prescriptions of permethrin and malathion in Scotland by SIMD quintiles.

plot_prescriptions_quintile <-   as.data.frame(prescriptions_by_quintile) %>% 
  mutate(`SIMD quintile` = as.factor(`SIMD quintile`)) %>%
  ggplot(aes(x=`SIMD quintile`, y=scabies_prescriptions_per_1000, fill=`SIMD quintile`)) +
  geom_boxplot()  +
  scale_fill_brewer(palette="Greens") +
  labs(title="Scabies prescriptions by Deprivation Quintile",
       subtitle = "January 2024",
       caption = "Data from SIMD (Scottish Index of Deprivation)",
       y = "Permethrin and Malathion prescriptions per 1,000 people") +
  theme_minimal()
#plotting the means of prescriptions by changing SIMD quintile to a factor and using a boxplot
ggplotly(plot_prescriptions_quintile)

The SIMD quintiles are bands of 20%, with the 20% most deprived regions residing in the 1st SIMD quintile and the 20% least deprived residing in the 5th quintile.

3.2 Discusssion (Quintiles)

Interestingly, if we look at the above figures, it seems that the prescriptions in the less deprived quintiles (1 being most deprived, 5 being the least) had higher average scabies medication prescriptions. This could be due to a number of reasons, such as:

  • missing data, especially in more deprived areas
  • lack of knowledge about scabies in more deprived area
  • lack of access to scabies treatments in more deprived areas (as measurement of prescription is a very indirect method of tracking the disease itself).

However, the data for permethrin and malathion prescriptions by datazones has very high standard deviations (figure 1). It’s therefore difficult to reach any strong conclusions about deprivation and permethrin/malathion prescriptions in Scotland in January 2024. A significant outlier can be noted in figure 2, where a data point in the 1st SIMD quintile has nearly 6000 prescriptions per 1000 people - i.e. almost 6 prescriptions per person. This datazone is within the Greater Glasgow and Clyde area. While it may reflect some form of an outbreak or an area with high population density/likelihood of a scabies outbreak such as a prison or care home, inspection of the datazone on maps shows it mostly consists of a cemetery (https://mapit.mysociety.org/area/158655.html). Very likely, this is thus a typographical error.

4 The effect of overcrowding on scabies prescriptions

We will also investigate the link between overcrowding, as supplied by SIMD data, and prescriptions of permethrin and malathion.

4.1 Prescription by percentage of housing overcrowding (Figure 2)

per_mal_2024_deprivation$house_o_crat= as.numeric(sub("%", "",per_mal_2024_deprivation$house_o_crat))

prescriptions_by_overcrowding_binned <- per_mal_2024_deprivation %>%
  mutate(overcrowding_bins = cut(per_mal_2024_deprivation$house_o_crat, breaks=c(0,5,10,15,20,25,30,35,40,45), 
labels = c("0-5%", "6-10%", "11-15%", "16-20%", "21-25%", "26-30%", "31-35%", "36-40%", "41-45%"))) %>% 
  filter(!is.na(overcrowding_bins)) %>% 
  group_by(overcrowding_bins) %>% 
  summarise(prescriptions_housing = (sum(paid_quantity)/sum(population))*1000) %>% 
  ungroup()
#we create a new column using house_o_crat, the variable for percentage of housing, and create breaks each 5% using the cut function and assign labels to each of these distinct bins, before filtering by any NA values. We group by these bins and calculate a value for prescriptions per 1,000 in each of the bins.
prescriptions_by_overcrowding_binned_graph <- prescriptions_by_overcrowding_binned %>% ggplot(aes(x=overcrowding_bins, y=prescriptions_housing, fill=overcrowding_bins)) + 
  geom_col() +
  labs(title = "Permethrin and Malathion prescriptions by overcrowding in datazones",
    x = "Overcrowding (%)",
    y = "Permethrin and Malathion prescriptions per 1,000 people",
    fill = "Overcrowding bins") +
  theme_minimal() + 
  scale_fill_brewer(palette="Reds")
#the above code creates a column graph from the data on the permethrin and malathion prescriptions per 1,000 people as related to overcrowding data
ggplotly(prescriptions_by_overcrowding_binned_graph)

Reference: ‘cut’ function

4.2 Discussion (Overcrowding)

When we look at prescriptions per 1000 people, we see that there is quite a significantly greater number of prescriptions in the 41-45% overcrowding datazone group. However, there isn’t a strong trend in the datazones with lower levels of overcrowding. Again, we would need further data to be able to make strong statements.

5 The rise in scabies medication prescription over time

5.1 Introduction

Numerous sources have reported a rise in scabies in Scotland (The Guardian, The Herald, BBC), but there is a lack of quantitative evidence supporting this. I thus decided to investigate the trends in scabies and malathion prescriptions in Scotland between the years 2016-2024, split up by health boards. I chose to investigate health boards rather than datazones partially because of the significant amount of data, and because he focus was on temporal rather than spatial analysis. I chose 2016-2024 as it represented 8 years of some of the most recent data, centered around January because scabies is typically more common in winter. It would be interesting, in further study, to see whether there are spikes in scabies prescriptions during the winter months, and thus do a smaller-scale temporal analysis.

5.2 Preparatory Data Handling

5.2.1 Loading in datasets for January 2016 - January 2024

url = c(
    "https://www.opendata.nhs.scot/dataset/84393984-14e9-4b0d-a797-b288db64d088/resource/7c487db9-f450-4319-8b18-4b825380692b/download/pitc201601.csv",
    "https://www.opendata.nhs.scot/dataset/84393984-14e9-4b0d-a797-b288db64d088/resource/3def845e-c830-4b73-8e02-e42adcde5afa/download/pitc201701.csv",
    "https://www.opendata.nhs.scot/dataset/84393984-14e9-4b0d-a797-b288db64d088/resource/82ef2c55-5a31-4759-ab9a-83b176e107f2/download/pitc201801.csv",
    "https://www.opendata.nhs.scot/dataset/84393984-14e9-4b0d-a797-b288db64d088/resource/3bd6e3cc-b8b7-493b-b6aa-f9fcadbb05d2/download/pitc201901.csv",
    "https://www.opendata.nhs.scot/dataset/84393984-14e9-4b0d-a797-b288db64d088/resource/e5c841f2-3e16-428b-97db-0798ec7a5fb4/download/pitc202001.csv",
   "https://www.opendata.nhs.scot/dataset/84393984-14e9-4b0d-a797-b288db64d088/resource/7722a6e6-a6b6-49ec-aa63-fdc4bc727f05/download/pitc202101.csv",
    "https://www.opendata.nhs.scot/dataset/84393984-14e9-4b0d-a797-b288db64d088/resource/53a53d61-3b3b-4a12-888b-a788ce13db9c/download/pitc202201.csv",
    "https://www.opendata.nhs.scot/dataset/84393984-14e9-4b0d-a797-b288db64d088/resource/6caa7144-f0e4-4ab0-b9e4-8fa10c8edf1c/download/pitc202301.csv",
    "https://www.opendata.nhs.scot/dataset/84393984-14e9-4b0d-a797-b288db64d088/resource/d3eaaf84-0f3b-4fb8-9460-e33503095fbe/download/pitc202401.csv"
  )
combined_data_tidy <- map_df(url, read_csv) %>% 
  clean_names()
#this function maps to a dataframe, taking in the list of urls and the read_csv function and applying the function to each url, then returning a single concatenated dataframe
hb_population_data <- read_csv("https://www.opendata.nhs.scot/dataset/7f010430-6ce1-4813-b25c-f7f335bdc4dc/resource/27a72cc8-d6d8-430c-8b4f-3109a9ceadb1/download/hb2019_pop_est_14102024.csv") %>% 
  clean_names() %>% 
  filter(sex=="All") %>% 
  select(year, hb, population=all_ages) %>% 
  mutate(year = as.character(year)) #added so it's the same type as the year in combined_data and we can join them
scottish_health_boards <- st_read(here("data","SG_NHS_HealthBoards_2019/SG_NHS_HealthBoards_2019.shp"),
  quiet = TRUE) %>% 
  clean_names()

Reference: ‘map_df’ function

5.2.2 Selecting and filtering relevant data for temporal analysis

When we’ve mapped the data from the January of successive years 2016-2024 into a dataframe, we’re going to select the relevant columns, filter out scabies medication, clean up the data and use the function ‘complete’ from tidyr to add 0 values to any hb/year pairs that have no data about scabies prescriptions (to prevent loss of data when merging with spatial data later).

scabies_over_years <- combined_data_tidy %>% 
  clean_names() %>% 
  select(bnf_item_description,paid_quantity,hbt2014,hbt,paid_date_month) %>% 
   filter(str_detect(bnf_item_description, "PERMETH|MALATH")) %>% 
  mutate(hb = coalesce(hbt2014,hbt)) %>% 
  mutate(year=substr(as.character(paid_date_month),1,4)) %>% 
  select(,-hbt2014,-hbt,-paid_date_month) 
#takes data about prescriptions from 2016-2024, uses coalesce with mutate to change the 'hbt2014'/'hbt' column found in some of the data to 'hb' by replacing the NA value in their hb column with the hbt2014/hbt value, then makes a new column that selects just the year from the paid_date_month column, before removing irrelevant columns.

Reference: ‘coalesce’ dplyr

5.2.3 Joining prescription data with population data and calculating prescriptions by 10,000 per datazone

scabies_over_time_pop <- scabies_over_years %>% 
  mutate(join_year = if_else(year == "2024", "2023", year)) %>% 
  left_join(hb_population_data %>% rename(join_year = year), by = join_by(hb, join_year)) %>% 
  mutate(paid_per_10000 = paid_quantity*10000 / population) %>% 
  group_by(hb, year) %>% 
  summarise(permethrin_and_malathion_per_10000 = sum(paid_per_10000, na.rm = TRUE)) %>% 
  ungroup() %>% 
  complete(hb, year, fill=list(permethrin_and_malathion_per_10000 = 0))
#The hb_population data lacks information for 2024, so we're using the most recent population data from 2023 for that year instead. Uses complete to fill healthboards that have no data on prescriptions with 0, so their geometry data is still retained in later joins/filtering.

Reference: ‘complete’ function

5.3 Scabies prescriptions in different datazones over time (Figure 3)

scabies_over_time_line <- scabies_over_time_pop %>% 
  right_join(scottish_health_boards, by = join_by(hb == hb_code)) %>%
  ggplot(aes(x=year, y=permethrin_and_malathion_per_10000, group=hb_name)) +
  geom_line(aes(color=hb_name)) +
  labs(title = "Scabies prescriptions by year", 
       subtitle = "per 10,000 people",
       color= "Health Board",
       x= "Year",
       y = "Permethrin and Malathion presciptions") +
  theme_minimal()
ggplotly(scabies_over_time_line)

5.4 Scabies prescriptions in different datazones, faceted by year (Figure 4)

scabies_over_time_pop_sf <- scabies_over_time_pop %>% 
    right_join(scottish_health_boards, by = join_by(hb == hb_code)) %>%
    st_as_sf()
scabies_maps <- scabies_over_time_pop_sf %>%
  split(.$year) %>% 
  lapply(function(data) {
    ggplot(data, aes(fill = permethrin_and_malathion_per_10000)) +
      geom_sf(colour = "black") +
      scale_fill_viridis_c(
        option = "inferno", 
        direction = -1, 
        name = "Prescriptions of\npermethrin and malathion", 
        limits = c(0, 2000) #added a limit because otherwise, the cowplot plots will each have different scales
      ) +
      ggspatial::annotation_scale(location = "tl", width_hint = 0.5) +
      theme_minimal() +
      labs(
        title = paste("Year:", unique(data$year))
      )
  })
plot_grid(plotlist = scabies_maps, ncol = 2)

#When I used ggplot to plot the maps without faceting, it returned a map with overlaid axes and looked quite distorted. I tried using scales="free", but that's not supported by geom_sf. So, as in <https://stackoverflow.com/questions/44814768/mapping-different-states-in-r-using-facet-wrap>, I ended up using a library called cowplot, by plotting each map separately and then knitting them together. This came with its own issues, such as overlapping labels (so I changed to an overarching label) and the maps having different scale fill ranges, which made it hard to compare. 
#I ended up setting limits for the scale_fill_viridis_c function before the cowplot, so as to ensure they were comparable and had the same scale.
#I considered using ggdraw, following <https://www.geeksforgeeks.org/adding-x-and-y-axis-label-to-ggplot-grid-built-with-cowplot-in-r/>, to set an overarching title/label for the plots rather than to have the same title copied multiple times. However, I recognized that it actually achieved the same desired result to just have an Rmarkdown title.

6 Further Study

The current report has focused only on data from one month (January) over the years 2016-2024. This doesn’t reflect season changes in scabies prescriptions, and could be extended to include multiple months over the years. Further, a significant limitation I’ve touched upon is that while the data reflects an upwards trend in prescriptions for treatment of scabies, prescriptions are only a correlate of actual cases of scabies and may not reflect all instances of the condition, especially untreated scabies cases. Additionally, the prescription data may have some inaccuracies in its reporting, as has already been potentially mentioned in regards to the outlier in figure 2. Aside from expanding the temporal analysis to include more months, we could also attempt to extend it over the years, and expand the spatial analysis from health boards to datazones.

7 A statement on generative AI

I’ve used generative AI (chatGPT) to check over code during the debugging process. This was mainly when I was attempting to create ‘overcrowding bins’ for the data and kept getting errors because I was trying to use the ‘cut’ function wrong. I also got some errors during rendering, which were because it turned out I had accidentally left a quotation mark without closing it when editing the code and then tried to knit without running.

8 References

Boseley, S. (2024, February 18). Return of Victorian-era diseases to the UK: Scabies, measles, rickets, scurvy. The Guardian. Retrieved from https://www.theguardian.com/society/2024/feb/18/return-of-victorian-era-diseases-to-the-uk-scabies-measles-rickets-scurvy World Health Organization. (n.d.). Scabies. Retrieved November 24, 2024, from https://www.who.int/news-room/fact-sheets/detail/scabies Campbell, D. (2024, January 1). Doctors report “nightmare” surge in scabies across UK. The Guardian. Retrieved from https://www.theguardian.com/society/2024/jan/01/doctors-report-nightmare-surge-in-scabies-across-uk Herald Scotland. (2024, January 1). Scabies cases on the rise: Symptoms and treatments. Retrieved from https://www.heraldscotland.com/news/national/uk-today/24680241.scabies-cases-rise---symptoms-treatments/ BBC News. (2023, May 20). Scabies cases rising across the Highlands and Islands. Retrieved from https://www.bbc.co.uk/news/uk-scotland-highlands-islands-65645796